Naiive Forecasts

Week 2

Evie Zhang

Old Dominion University

Topics

  1. \(E[y]\), \(\hat{y}\)
  2. Loss Functions
  3. Lags, Leads
  4. Conditional Forecasts
  5. Time Series Components

International Air Passengers

Code
library(pacman)
p_load(tidyverse, tidyverse, data.table, ggplot2, lubridate)
data("AirPassengers")
air <- AirPassengers; rm(AirPassengers)
range(air)
[1] 104 622
  • What would be a good forecast, and why?

  • What is the “goal” or “objective”?

International Air Passengers

Does visualizing the data change your mind?

Code
ggplot(data.frame(air), aes(x = air)) + 
  geom_density(fill = "lightblue", color = "lightblue") + 
  labs(x = "Count of Passengers", y = "Density", title = "") +
  theme_minimal() +  # Simpler background theme with x and y axis
  theme(
    plot.title = element_text(hjust = 0.5),  # Centering title
    axis.text.x = element_text(angle = 0, hjust = 1, color = "black"),
    axis.text.y = element_text(color = "black")
  )

International Air Passengers

  • Point Forecast: a best guess (\(\hat{Y}\)) for an unknown value.

  • Forecast Distribution: a distribution, or range, of possibilities and their corresponding likelihoods/probabilities.

  • Point Forecast (again): a summary statistic of a forecast distribution

    • How do we decide on the appropriate summary measure for our point forecast?
    • Loss Function

Forecast Error

\[ e = y - \hat{y} \]

Will there always be forecast error?

So long as \(y\) is not deterministic, yes!

Of course, some errors are worse than others.

\[ L(e): \text{Loss Function} \]

Loss Functions

What are some examples of Loss Functions?

Absolute Loss

\[L(e) = |e| \]

Quadratic Loss

\[L(e) = e^2 \]

What are some features (good / bad) about these Loss Functions?

Risk

Forecasts seek to minimize risk. Risk is defined as the expected loss

\[R(\hat{y}) = E[L(e)] = E[L(y - \hat{y})]\]

For quadratic loss:

\[ E[(y - \hat{y})^2] = E[y^2] - 2\hat{y}E[y] + \hat{y}^2\]

Remember, we know \(E[y]\). From here, we can set to minimize this function with respect to \(\hat{y}\).

\[E[y^2] - 2\hat{y}E[y] + \hat{y}^2 \implies 0 = 2E[y] - 2\hat{y}\]

After simplifying, we’ll have \(\hat{y} = E[y]\). So, the \(\hat{y}\) that minimizes our risk is the mean of the distribution of \(y\).

Expectations

So, if \(\hat{y}^* = E[y]\), how do we calculate \(E[y]\)?

Continuous:

  • \(\int xP(x) \,dx\)

Discrete:

  • \(\sum xP(x)\)

Empirical:

  • \(\frac{1}{n}\sum_{i = 1}^{n} x_i\)

Expectation – Weighted Coin

Suppose you flip a weighted coin with the following distribution:

\[P(y = T) = \frac{3}{7}\] \[P(y = H) = \frac{4}{7}\]

If the coin comes up tails, you are given $40. If the coin comes up heads, you owe $25.

What is the expected value of the coin flip? What is the expected “risk” (i.e. \(E[L(e)]\))?

International Air Passengers, Part 2

Let’s use our shiny new optimal forecast.

Code
# Convert the time series to a tibble
air_df <- data.frame(
  date = seq(as.Date("1949-01-01"), as.Date("1960-12-01"), by = "months"),
  count = as.numeric(air)
)

# Create a tibble for air2
air2_df <- tibble(date = seq(as.Date("1961-01-01"), as.Date("1962-12-01"), by = "months"),
                  count = mean(air))

# Plot using ggplot2
ggplot(air_df, aes(x = date, y = count)) +
  geom_line() +
  geom_line(data = air2_df, aes(x = date, y = count), col = "tomato", lwd = 1, linetype = "dashed") +
  xlim(as.Date(c("1949-01-01", "1963-01-01"))) +
  labs(x = "Date", y = "Count of Passengers") +
  theme_minimal()

This looks terrible, but let’s continue for now.

Forecast Interval

If our variable of interest is normally distributed, we can calculate a confidence interval for our predicted outcome.

\[ [\mu - \sigma Z_{\alpha/2}, \mu + \sigma Z_{\alpha/2}] \]

\[ [\hat{y} - \sigma Z_{\alpha/2}, \hat{y} + \sigma Z_{\alpha/2}] \]

\[ [280.3 - 120 * 1.96, 280.3 + 120 * 1.96] \]

\[ [45.1, 515.5] \]

International Air Passengers, Part 3

Code
# Create a sequence of dates for 1961-01 to 1962-12
dates <- seq(as.Date("1961-01-01"), by = "month", length.out = 24)

# Calculate the mean value and the standard deviation for the entire 'air' dataset
air_mean <- mean(air)
air_sd <- sd(air)

# Construct the dataframe
air2_df <- tibble(
  date = dates,
  mean = rep(air_mean, times = 24),  # The mean is a constant for every row
  upper = rep(air_mean + (1.96 * air_sd), times = 24),
  lower = rep(air_mean - (1.96 * air_sd), times = 24)
)

# Plot using ggplot2
ggplot(air_df, aes(x = date, y = count)) +
  geom_line() +
  geom_line(data = air2_df, aes(y = mean), col = "tomato", lwd = 1, linetype = "dashed") +
  geom_line(data = air2_df, aes(y = upper), col = "tomato", lwd = 1) +
  geom_line(data = air2_df, aes(y = lower), col = "tomato", lwd = 1) +
  labs(x = "Time", y = "Frequency") + 
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.5),  # Centering title
    axis.text.x = element_text(angle = 0, hjust = 1, color = "black"),
    axis.text.y = element_text(color = "black")
  )

How to Code

Code
# Create a sequence of dates
timez <- seq(as_date("1949-01-01"), as_date("1962-12-01"), by = "month")

# Create the air3 tibble
air3 <- tibble(
  time = timez,
  air = c(as.vector(air), rep(NA, 24))
)

# Fit a constant model
reg <- lm(air ~ 1, data = air3)

# Add the predictions and confidence intervals to air3
air3 <- air3 %>%
  mutate(
    pred = predict(reg, newdata = air3),
    pred_l = pred - (1.96 * sd(residuals(reg))),
    pred_u = pred + (1.96 * sd(residuals(reg)))
  )

# View the first few rows of the tibble
head(air3, 3)
# A tibble: 3 × 5
  time         air  pred pred_l pred_u
  <date>     <dbl> <dbl>  <dbl>  <dbl>
1 1949-01-01   112  280.   45.2   515.
2 1949-02-01   118  280.   45.2   515.
3 1949-03-01   132  280.   45.2   515.

How to Code

Code
## make the plot
ggplot(air3, aes(x = time, y = air)) +
  geom_line() +
  geom_line(data = filter(air3, time > ymd("1960-12-01")), 
            aes(y = pred), col = "tomato", linetype = "dashed") +
  geom_line(data = filter(air3, time > ymd("1960-12-01")), 
            aes(y = pred_l), col = "tomato") +
  geom_line(data = filter(air3, time > ymd("1960-12-01")), 
            aes(y = pred_u), col = "tomato") +
  coord_cartesian(xlim = as.Date(c("1949-01-01", "1963-01-01")), 
                  ylim = c(min(air3$pred_l, na.rm = TRUE), max(air3$air, na.rm = TRUE))) +
  labs(x = "", y = "Frequency") +
  theme_minimal() + 
  theme(
    plot.title = element_text(hjust = 0.5),  # Centering title
    axis.text.x = element_text(angle = 0, hjust = 1, color = "black"),
    axis.text.y = element_text(color = "black")
  )

Time

Everything we’ve talked about thus far has been as true for cross section as for time series data.

Let’s talk about time now.

Cross Section vs Time

  • Cross section units are often denoted with an \(i\), representing “iterator”.
    • \(Y_i\), \(X_i\)
  • Time series units are denoted with a \(t\), representing time.
    • \(Y_t\), \(X_t\)
  • Panel units are denoted with a both \(i\) and \(t\).
    • \(Y_{it}\), \(X_{it}\)

Time

Time, or rather periodicity, can come in many forms:

  • Year
  • Month
  • Day
  • Minute
  • Transaction, or an instance

This is called frequency.

Leads, Lags

A common feature of time series data is intertemporal dependency. For this, we have lags and leads.

Lags: \(y_{t-1}\), \(y_{t-2}\), \(y_{t-k}\)

Leads: \(y_{t+1}\), \(y_{t+2}\), \(y_{t+k}\)

Sample: \(\{y_{1}, y_{2}, y_{3}, ..., y_{T}\}\)

Forecast Notation

Periods:

Sample: \(\{y_{1}, y_{2}, y_{3}, ..., y_{T}\}\)

Out of Sample: \(\{y_{T+1}, y_{T+2}, ..., y_{T+h}\}\) where \(h\) is the horizon

Notation:

\(\hat{y}\) : \(y\) \(\implies\) \(\hat{y_t}\) : \(y_t\)

\(\hat{y_t}\) : \(y_t\) \(\implies\) \(\hat{y_{T+h}}\) : \(y_{T+h}\)

However, these are all unclear.

\(\hat{y_{T+h|t}}\), \(y_{T+h|t}\)

Conditioning

  1. Conditioning with relevant variables will improve forecast (i.e. reduce risk)

  2. Both point forecasts and forecast intervals are functions of conditioning variables.

Let’s call all available information \(\Omega_t\).

Conditioning Examples

Code
library(gridExtra)

b <- read.csv("../data/bball_allstars.csv")

# Plot for all basketball players
p1 <- ggplot(b, aes(x = HEIGHT)) +
  geom_density(fill = "dodgerblue", alpha = 0.6) +
  geom_vline(aes(xintercept = mean(HEIGHT)), color = "black") +
  coord_cartesian(xlim = c(66, 85), ylim = c(0, .12)) +
  labs(title = "Height of Pro Basketball Players",
       x = paste("Mean:", round(mean(b$HEIGHT), 1)),
       y = "Density") +
  theme_minimal()

# Plot separating by League
p2 <- ggplot(b, aes(x = HEIGHT, fill = LEAGUE)) +
  geom_density(alpha = 0.5) +
  geom_vline(aes(xintercept = mean(HEIGHT[LEAGUE == "WNBA"])), color = "tomato", linetype = "dashed") +
  geom_vline(aes(xintercept = mean(HEIGHT[LEAGUE == "NBA"])), color = "dodgerblue", linetype = "dashed") +
  geom_vline(aes(xintercept = mean(HEIGHT)), color = "black") +
  coord_cartesian(xlim = c(66, 85), ylim = c(0, .12)) +
  labs(title = "Height of Pro Basketball Players",
       x = paste("Mean WNBA:", round(mean(b$HEIGHT[b$LEAGUE == "WNBA"]), 1), "| Mean NBA:", round(mean(b$HEIGHT[b$LEAGUE == "NBA"]), 1)),
       y = "Density") +
  theme_minimal() +
  scale_fill_manual(values = c("WNBA" = "tomato", "NBA" = "dodgerblue")) +
  theme(legend.position = "topright")

# Print plots
grid.arrange(p1, p2, ncol=2)

Code
# Adding the 'letter' column to the data
b <- b %>% 
  mutate(letter = ifelse(substr(PLAYER, 1, 1) %in% letters[1:10], "A-J", "K-Z"))

# Combine data for plotting
p1 <- ggplot(b, aes(x = HEIGHT)) +
  geom_density(fill = "dodgerblue", alpha = 0.6) +
  geom_vline(aes(xintercept = mean(HEIGHT)), linetype = 2, color = "black") +
  labs(
    title = "Height of Pro Basketball Players",
    x = paste("Mean:", round(mean(b$HEIGHT), 1)),
    y = "Density"
  ) +
  coord_cartesian(xlim = c(65, 90), ylim = c(0, 0.12)) +
  theme_minimal()

p2 <- ggplot(b, aes(x = HEIGHT, fill = letter)) +
  geom_density(alpha = 0.6) + # to scale the densities so they fit into the same y-axis
  geom_vline(aes(xintercept = mean(HEIGHT)), linetype = 2, color = "black") +
  geom_vline(data = filter(b, letter == "A-J"), aes(xintercept = mean(HEIGHT)), color = "dodgerblue", linetype = 2) +
  geom_vline(data = filter(b, letter == "K-Z"), aes(xintercept = mean(HEIGHT)), color = "tomato", linetype = 2) +
  labs(
    title = "Height of Pro Basketball Players",
    x = paste("Mean A-J:", round(mean(b$HEIGHT[b$letter == "A-J"]), 1), "| Mean K-Z:", round(mean(b$HEIGHT[b$letter == "K-Z"]), 1)),
    y = "Density"
  ) +
  scale_fill_manual(values = c("dodgerblue", "tomato")) +
  coord_cartesian(xlim = c(65, 90), ylim = c(0, 0.12)) +
  theme_minimal() +
  theme(legend.position = "top")

# Print plots
grid.arrange(p1, p2, ncol=2)

Time Series Components

Just like with cross sectional settings, we may (or may not) observe some additional variables. However, we still need to model the Data Generating Process (DGP).

Data are not a choice, per se, but you do choose functional forms and specifications.

Once we select a model, we need to estimate the model’s parameters with data.

\[y_t = f(y_{t-1}, x_{t}, x_{t-1}, \tau_t, C_t, S_t)\]

Time Series Components

\[y_t = f(y_{t-1}, x_{t}, x_{t-1}, \tau_t, C_t, S_t)\]

\[y_t = \tau_t + C_t + S_t\]

  • \(\tau\): Trend
  • \(C\): Cycle
  • \(S\): Season

Components

Trend

  • Very long term; Smooth

Cycle

  • Repeats 2-7 years; Business Cycle

Season

  • Annual patterns

It is useful to consider these things separately (additively)

Components

Code
air_df <- tibble(date = seq(as.Date("1949-01-01"), as.Date("1960-12-01"), by = "months"),
                 count = as.numeric(air))


# Plot using ggplot2
ggplot(air_df, aes(x = date, y = count)) +
  geom_line() +
  labs(x = "Month", y = "Count of Passengers") +
  theme_minimal() +
  theme(legend.title = element_blank(),
        legend.position = "topright",
        legend.text = element_text(size = 12))

Mean Forecast

The simplest models has an intercept, but no trend, cycle or season.

\[E[y_{t+h}|\Omega_t] = \beta_0\]

This is simple, but what type of setting would this be appropriate for?

Naiive Forecast

Code
consume <- read.csv("../data/realpersonalconsumptionexpend_pc.csv")
consume$DATE <- ymd(consume$DATE)
Code
ggplot(consume, aes(x = DATE, y = PCEC96_PC1)) +
  geom_line(lwd = 1) +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_hline(yintercept = mean(consume$PCEC96_PC1), color = "dodgerblue", lwd = 1) +
  labs(
    x = "Month", 
    y = "Percent Change (Year over Year)",
    title = "Real Personal Consumption Expenditures (PCEC96)"
  ) +
  theme_minimal()

Code
consume_filtered <- consume %>% 
  filter(DATE < ymd("2020-01-01"))

ggplot(consume_filtered, aes(x = DATE, y = PCEC96_PC1)) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_hline(yintercept = mean(consume_filtered$PCEC96_PC1), color = "dodgerblue", lwd = 1) +
  labs(
    x = "Month", 
    y = "Percent Change (Year over Year)",
    title = "Real Personal Consumption Expenditures (PCEC96)"
  ) +
  theme_minimal() +
  coord_cartesian(xlim = c(min(consume$DATE), max(consume$DATE)))  # setting xlim to the range of the original data

Code
consume_filtered <- consume %>%
  filter(DATE < ymd("2020-01-01"))

mean_m <- mean(consume_filtered$PCEC96_PC1)
sd_m <- sd(consume_filtered$PCEC96_PC1)

ggplot(consume, aes(x = DATE, y = PCEC96_PC1)) +
  geom_line(data = consume_filtered) +
  geom_hline(yintercept = 0, linetype = 2) +
  
  # Point forecast line
  geom_segment(aes(x = ymd("2020-01-01"), xend = max(consume$DATE), 
                   y = mean_m, yend = mean_m, linetype = "Point Forecast"), 
               color = "mediumseagreen", lwd = 1) +
  
  # Upper CI line
  geom_segment(aes(x = ymd("2020-01-01"), xend = max(consume$DATE), 
                   y = mean_m + 1.645 * sd_m, yend = mean_m + 1.645 * sd_m, linetype = "90% CI"), 
               color = "mediumseagreen", lwd = 1) +
  
  # Lower CI line
  geom_segment(aes(x = ymd("2020-01-01"), xend = max(consume$DATE), 
                   y = mean_m - 1.645 * sd_m, yend = mean_m - 1.645 * sd_m, linetype = "90% CI"), 
               color = "mediumseagreen", lwd = 1) +
  
  labs(
    x = "Month", 
    y = "Percent Change (Year over Year)",
    title = "Real Personal Consumption Expenditures (PCEC96)"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom") +
  coord_cartesian(xlim = c(min(consume$DATE), max(consume$DATE))) +
  
  # Add the legend
  scale_linetype_manual(
    name = "", 
    values = c("Point Forecast" = 1, "90% CI" = 3)
  )

Errors vs Residuals

Forecast Errors

\(e_t = y_{t+h} - E[y_{t+h}|\Omega_t] \implies y_{t+h} = E[y_{t+h}|\Omega_t] + e_t\)

Residuals

\(\hat{e_t} = y_{t+h} - \hat{y}_{t+h} = y_{t+h} - \beta_0\)

It may be helpful to plot the residuals over time to see if there are any remaining patterns.

Next Class

  1. Forecast Variance

  2. Trend Models

    • Functional Forms
    • Intercept Breaks
    • Trend Breaks